83 research outputs found

    Summary Management in P2P Systems

    Get PDF
    International audienceSharing huge, massively distributed databases in P2P systems is inherently difficult. As the amount of stored data increases, data localization techniques become no longer suf- ficient. A practical approach is to rely on compact database summaries rather than raw database records, whose access is costly in large P2P systems. In this paper, we consider summaries that are synthetic, multidimensional views with two main virtues. First, they can be directly queried and used to approximately answer a query without exploring the original data. Second, as semantic indexes, they support locating relevant nodes based on data content. Our main contribution is to define a summary model for P2P systems, and the appropriate algorithms for summary management. Our performance evaluation shows that the cost of query routing is minimized, while incurring a low cost of summary maintenance

    Joining Distributed Database Summaries

    Get PDF
    The database summarization system coined SaintEtiQ provides multi-level summaries of tabular data stored into a centralized database. Summaries are computed online with a conceptual hierarchical clustering algorithm. However, in many companies, data are distributed among several sites, either homogeneously (i.e. , sites contain data for a common set of features) or heterogeneously (i.e. , sites contain data for different features). Consequently, the current centralized version of SaintEtiQ is either not feasible or even not desirable due to privacy or resource issues. In this paper, we propose two new algorithms for summarizing heterogeneously distributed data without a prior "unification" of the data sources: Subspace-Oriented Join Algorithm (SOJA) and Tree Alignement-based Join Algorithm (TAJA). The main idea of such algorithms consists in applying innovative joins on two local models, computed over two disjoint sets of features, to provide a global summary over the full feature set without scanning the raw data. SOJA takes one of the two input trees as the base model and the other one is processed to complete the first one, whereas TAJA rearranges summaries by levels in a top-down manner. Then, we propose a consistent quality measure to quantify how good our joined hierarchies are. Finally, an experimental study, using synthetic data sets, shows that our joining processes (SOJA and TAJA) result in high quality clustering schemas of the entire distributed data and are very efficient in terms of computational time w.r.t. the centralized approach

    Peersum : Gestion des résumés de données dans les systèmes P2P

    Get PDF
    Base de Données Avancées (BDA)National audienceSharing huge, massively distributed databases in P2P systems is inherently difficult. As the amount of stored data increases, data localization techniques become no longer sufficient. A practical approach is to rely on compact database summaries rather than raw database records, whose access is costly in large P2P systems. In this paper, we consider summaries that are synthetic, multidimensional views with two main virtues. First, they can be directly queried and used to approximately answer a query without exploring the original data. Second, as semantic indexes, they support locating relevant nodes based on data content. The main contribution of this paper is to define an efficient algorithm for partitioning an unstructured P2P network into domains, in order to optimally distribute summaries in the network. Then, we propose a distributed algorithm for maintaining a summary in a given domain. Our performance evaluation shows that the cost of query routing is minimized, while incurring a low cost of summary maintenance

    Summary Management in P2P Systems

    Get PDF
    International audienceSharing huge, massively distributed databases in P2P systems is inherently difficult. As the amount of stored data increases, data localization techniques become no longer suf- ficient. A practical approach is to rely on compact database summaries rather than raw database records, whose access is costly in large P2P systems. In this paper, we consider summaries that are synthetic, multidimensional views with two main virtues. First, they can be directly queried and used to approximately answer a query without exploring the original data. Second, as semantic indexes, they support locating relevant nodes based on data content. Our main contribution is to define a summary model for P2P systems, and the appropriate algorithms for summary management. Our performance evaluation shows that the cost of query routing is minimized, while incurring a low cost of summary maintenance

    PeerSum: a Summary Service for P2P Applications

    Get PDF
    International audienceSharing huge databases in distributed systems is inherently difficult. As the amount of stored data increases, data localization techniques become no longer sufficient. A practical approach is to rely on compact database summaries rather than raw database records, whose access is costly in large distributed systems. In this paper, we propose PeerSum, a new service for managing summaries over shared data in large P2P and Grid applications. Our summaries are synthetic, multidimensional views with two main virtues. First, they can be directly queried and used to approximately answer a query without exploring the original data. Second, as semantic indexes, they support locating relevant nodes based on data content. Our main contribution is to define a summary model for P2P systems, and the algorithms for summary management. Our performance evaluation shows that the cost of query routing is minimized, while incurring a low cost of summary maintenance

    Gestion de résumés de données dans les systèmes pair–pair

    Get PDF
    International audienceIn this paper, we propose managing data summaries in unstructured P2P systems. Our summaries are intelligible views with two main virtues. First, they can be directly queried and used to approximately answer a query. Second, as semantic indexes, they support locating relevant nodes based on data content. The performance evaluation of our proposal shows that the cost of query routing is minimized, while incurring a low cost of summary maintenance.Dans ce travail, nous proposons de maintenir des résumés de données dans les systèmes P2P non structurés. Nos résumés sont des vues intelligibles ayant un double avantage en traitement de requête. Ils peuvent soit répondre d'une manière approximative à une requête, soit guider sa propagation vers les pairs pertinents en se basant sur le contenu des données. L'évaluation de performance de notre proposition a montré que le coût de requêtes est largement réduit, sans induire des côuts élevés de maintenance de résumés

    Data Sharing in P2P Systems

    Get PDF
    To appear in Springer's "Handbook of P2P Networking"In this chapter, we survey P2P data sharing systems. All along, we focus on the evolution from simple file-sharing systems, with limited functionalities, to Peer Data Management Systems (PDMS) that support advanced applications with more sophisticated data management techniques. Advanced P2P applications are dealing with semantically rich data (e.g. XML documents, relational tables), using a high-level SQL-like query language. We start our survey with an overview over the existing P2P network architectures, and the associated routing protocols. Then, we discuss data indexing techniques based on their distribution degree and the semantics they can capture from the underlying data. We also discuss schema management techniques which allow integrating heterogeneous data. We conclude by discussing the techniques proposed for processing complex queries (e.g. range and join queries). Complex query facilities are necessary for advanced applications which require a high level of search expressiveness. This last part shows the lack of querying techniques that allow for an approximate query answering

    Multi-Dimensional Grid-Based Clustering of Fuzzy Query Results

    Get PDF
    In usual retrieval processes within large databases, the user formulates a first basic (broad) query to target and filter data and next, she starts browsing the answer looking for precise information. We then propose to perform an offline hierarchical grid-based clustering of the data set in order to quickly provide the user with concise, useful and structured answers as a starting point for an online exploration. Every single answer item describes a subset of the queried data in a user-friendly form using linguistic labels, that is to say it represents a concept that exists within the data. Moreover, answers of a given 'blind' query are nodes of a classification tree and every subtree rooted by an answer offers a 'guided tour' of a data subset to the user. Finally, an experimental study shows that our process is efficient in terms of computational time and achieves high quality clustering schemas of query result

    Design of PeerSum: a Summary Service for P2P Applications

    Get PDF
    International audienceSharing huge databases in distributed systems is inherently difficult. As the amount of stored data increases, data localization techniques become no longer sufficient. A more efficient approach is to rely on compact database summaries rather than raw database records, whose access is costly in large distributed systems. In this paper, we propose PeerSum, a new service for managing summaries over shared data in large P2P and Grid applications. Our summaries are synthetic, multidimensional views with two main virtues. First, they can be directly queried and used to approximately answer a query without exploring the original data. Second, as semantic indexes, they support locating relevant nodes based on data content. Our main contribution is to define a summary model for P2P systems, and the algorithms for summary management. Our performance evaluation shows that the cost of query routing is minimized, while incurring a low cost of summary maintenance

    Organizing Gaussian mixture models into a tree for scaling up speaker retrieval

    Get PDF
    International audienceNumerous pattern recognition tasks set in the probabilistic framework face the following issue : it is expensive to evaluate the likelihood function for test data, when there are given very many candidate probabilistic models for explaining this data.We consider the application of this general and important problem to speaker recognition for indexing and retrieval purposes in radio archives.More precisely, we propose to reduce complexity at query time, by prior organization of speaker models into a hierarchy. This is very classically done for multi-dimensional vectors, but we propose herein a technique for building a hierarchy of probabilistic models, in the case these models take the form of a Gaussian mixture. From a closed-form approximation of Kullback-Leibler divergence between parent and children, an optimality criterion and an optimization technique are derived, from which we propose an efficient approach for building a tree of models, using clustering techniques (dendrogram-based or k-means-like). The proposed scheme is evaluated on real data
    corecore